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BOOTSTRAP INFERENCE FOR NETWORK CONSTRUCTION 

By Shuang , Li Hsut, Jie Peng^ and Pei Wang*'^' 

Fred Hutchinson Cancer Research Center"^ and University of California, Davis ^ 
Regularization techniques are widely used for tackling high- 
dimension-low-sample-size problems. Yet, finding the right amount 
of regularization can be challenging, especially in the unsupervised 
setting such as structure learning problems where traditional meth- 
ods such as BIG or cross-validation often do not work well. In this 
paper, we propose a new method — Bootstrap Inference for Network 
construction (BINCO) — to infer networks by directly controlling 
the false discovery rates (FDRs) of the selected edges. This method 
utilizes the idea of model aggregation. It fits a mixture model for the 
distribution of edge selection frequencies to estimate the FDRs. As 
this method only depends on selection frequencies, it is applicable to 
a wide range of applications beyond network construction. 

1. Introduction. 

Many modern datasets fall into the high-dimension-low-sample-size paradigm, where the number 
of features is much larger than the number of observations. Deriving interpretable results from such 
data is of keen interests in many scientific areas. At issue is how to select relevant features or struc- 
tures with limited sample sizes. Regularization on the model and/or parameter space thus becomes 
an indispensable strategy in order to achieve model selection and parameter estimation consistency. 
Commonly used regularization schemes include the Lq penalty such as the ridge (Hoerl and Ken- 
nard, 1970), Lasso (Tibshirani, 1996) and Bridge (Frank and Friedman, 1993; Fu, 1998; Knight and 
Fu, 2000; Liu et al., 2007; Huang et ai, 2008). Other popular penalties include the SCAD (Fan and 
Li, 2001) and elastic net (Zou and Hastie, 2005), among many others. A crucial but very challenging 
issue for all regularized procedures is to choose the proper amount of regularization. Many works 
have been done in the past years on the conditions under which consistent model selection can be 

*The correspondent author 

AMS 2000 subject classifications: Primary 62F40; secondary 62H20 

Keywords and phrases: high dimensional data, GGM, model aggregation, mixture model, FDR 

1 

imsart-aoas ver. 2009/08/13 file: BINCO_arXiv.tex date: November 23, 2011 



2 S. LI ET AL. 

achieved (e.g., Knight and Pu, 2000; Zhao and Yu, 2006; Meinshausen and Biihlmann, 2006; Zou, 
2006; Yuan and Lin, 2007; Wainwright 2009). However, the problem of finding the right amount of 
regularization with finite data is still not satisfactorily solved. 

Traditional methods such as Bayes information criteria (Schwarz, 1978) and cross-validation 
aim to find a model that minimizes prediction error or maximizes a targeted likelihood function. 
These prediction-criterion-based strategies may not be appropriate when the goal is to identify the 
relevant variable set such as in regression and classification and they are even more problematic 
for unsupervised learning such as network structure inference. This is because such criteria tend to 
include many irrelevant variables/features as observed by many authors, e.g., Efron (2004b), Efron 
et al. (2004), Meinshausen and Biihlmann (2006) and Peng et aZ.(2010). 

Recently, various model aggregation methods have been proposed to be coupled with regulariza- 
tion procedures and some of them also provide certain control of false positive findings. For example. 
Bach (2008) proposed Bolasso, which chooses variables that are selected by all the lasso models 
built on bootstrapped datasets. It was shown that Bolasso can achieve variable selection consistency 
in the regime where the original lasso procedure is inconsistent. In the context of network recon- 
struction, Peng et al. (2009) proposed to choose edges which are consistently selected across at least 
half of the cross validation folds. This strategy was shown by simulation experiments to be effective 
in reducing false edge detection without sacrificing too much power. More recently, Meinshausen 
and Biihlmann (2010) proposed the stability selection procedure which chooses variables such that 
their maximum selection frequencies built on perturbed data over a range of regularizations exceed 
a threshold. Under suitable conditions, they derived an upper bound for the expected number of 
false positives of such procedure, providing a useful guidance on choosing the cut-off value for the 
maximum selection frequencies. In the same paper, they also proposed the randomized lasso penalty 
which is another model aggregation strategy where the regularization parameter is perturbed. Com- 
bined with stability selection, it was shown to achieve model selection consistency without requiring 
the irrepresentable condition which is needed for lasso model selection consistency. In another re- 
lated work, Wang et al. (2011) proposed a similar modified lasso regression — random lasso — 
by aggregating models based on bootstrap samples and random subsets of variables, which was 
shown to give better results in terms of variable selection and prediction error compared to original 
lasso regression. All these works have greatly advanced the research in model selection in the high 
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dimensional regime. However, none provide direct estimation of the false positive rate (FDR) for 
the resulting variable selection set. 

In this paper, wc focus on the high-dimensional Gaussian graphical model inference, which has 
been of great interest in biomedical and social science research lately. It is an unsupervised learning 
problem, for which determining the proper amount of regularization is more challenging than su- 
pervised learning problems such as regression and classification. To tackle this problem, we propose 
to infer the network via model aggregation. In a spirit similar to the aforementioned methods, our 
method makes use of selection frequencies from a collection of models built through perturbing the 
data as well as the regularization parameter. These selection frequencies are modeled by a mixture 
distribution. An estimate of FDR on edge selection is then obtained which is in turn used as a cri- 
terion for determining the cutoff threshold for the selection frequencies and the proper amount of 
regularization. The framework of our method is rather general, as it only depends on the empirical 
distribution of selection frequencies. Thus, it can be applied to a wide range of problems beyond 
network construction. 

The rest of the paper is organized as follows. In Section 2, we describe in detail the proposed 
method. In Section 3, an extensive simulation study is conducted to evaluate the performance 
of the proposed method and the comparison with the stability selection procedure. In Section 4, 
the method is applied to a microarray expression data set of breast cancer to build a genetic 
interaction network. The paper is concluded with a final summary in Section 5. Proofs are given in 
the Appendix and additional results (figures and tables labeled by S-1, S-2, etc.) are provided in 
the Supplementary Materials. 

2. Method. 

2.1. Gaussian Graphical Models. 

In Gaussian Graphical Models (GGM), network construction is derived from the conditional 
dependence structure among the random variables. Let Y = (Yi, . . . ,Yp) denote a p-dimension 
random vector following a multivariate normal distribution N{0,Ti), where S is a p x jo positive 
definite matrix. The conditional dependence structure among Y is represented by an undirected 
graph G = {U, E) with vertices U = {1,2, . . . ,p} representing Yi, . . . , and the edge set E defined 
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4 S. LI ET AL. 

as 

E = : Yi and Yj are dependent given I < i,j < p} 

where = {Yk : k ^ 1 < k < p}. Then the goal is to identify the edge set E. Note that 

under normaUty assumption, the conditional independence between Yj, and Yj is equivalent to the 
partial correlation between Y^ and Yj being zero. It is also equivalent to the entry of the 
concentration matrix being zero, i.e., Cij = (Il^^)jj = (Dempster, 1972; Cox and Wermuth, 
1996). 

There are two main types of approaches to fitting a GGM. One is the maximum-likelihood-based 
approach which estimates the concentration matrix directly. The other is the regression-based 
approach, which fits the GGM through identifying non-zero regression coefficients of the following 
regression 

Yi = ^PijYj + €i, l<i<p, 

where Cj is uncorrelated with Y-i = {Yk,k ^ i,l < k < p} and Pij = —Cij/ca. Thus, non-zero 
/3jj's correspond to non-zero entries in the concentration matrix. In both approaches, there arc 
0{p^) parameters to estimate, which requires proper regularization on the model if p is larger 
than the sample size n. This can be achieved by making a sparsity assumption on the network 
structure, i.e., assuming that most pairs of variables are indeed conditionally independent given 
all other variables. Such an assumption is reasonable for many real life networks including the 
genetic regulatory networks (Gardner, et al. 2003; Jeong, et al. 2001; Tegner, et al. 2003). Methods 
have been developed along this line by using Li regularization. For example. Yuan and Lin (2007) 
proposed a sparse estimator of the concentration matrix via maximizing the Li penalized log- 
likelihood. Efficient algorithms were subsequently developed to fit this model with high dimensional 
data (Friedman, et al. 2007, Rothman, et al., 2008). For regression-based approaches, Meinshausen 
and Biihlmann (2006) considered the neighborhood selection estimator by minimizing p individual 
loss functions 

(2.1) L^^{P,Y) = \\\Y^-Y, f +A I/^^J-I' ^ = 1' • • • '^'^ 
while Peng, et al. (2009) proposed the space algorithm by minimizing the joint loss 

(2.2) L{Y,e) = lQ2\\Y,-Y,J^p^JyJf} + >^ E ift^i- 

i=l j-j¥=i i<i<j<P 
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Prom objective functions (2.1) and (2.2), it is clear that the selected edge set depends on the 
regularization parameter A. Since the goal here is to recover the true edge set, ideally A should 
be determined based on consideration such as FDR and power with respect to edge selection. 
Moreover, as the sample size is limited, a model-aggregation-based strategy can further improve 
the selection result compared to simply tuning the regularization parameter. Thus, in the following 
section, we introduce a new model-aggregation-based procedure which selects models based on 
directly estimating FDRs of the selected edges. 

To be consistent, we stay in the context of learning GGMs in the remaining of this paper. We 
refer to the set of all pairs of variables as the candidate edge set (denoted by 0,), the subset of 
those edges in the true model as the true edge set (denoted by E) and the rest as the null edge 
set (denoted by E'^). Note that Q, = EUE'^ and the total number of edges in J7 is Nq = p{p— l)/2. 

2.2. Model Aggregation. 

Consider a network construction procedure which is good in the sense that the true edges are 
stochastically more likely to be selected than the null edges. Then it would be reasonable to choose 
edges which have high selection probabilities. In practice, these selection probabilities can be esti- 
mated by the selection frequencies over networks constructed based on perturbed data sets. As a 
result, selecting edges with large selection frequencies can be a powerful procedure. In the following, 
we formalize this idea. 

Let ^(A) be an edge selection procedure with a regularization parameter A and S^{Y) = 
S'^{A{X),Y) be the set of selected edges by applying A{X) to data Y. The selection probability 
of edge is defined as 

Pij = E(^I{ii,j)eS\Y)}), 

where /{•} is the indicator function. Let R(Y) be the space of resamples from Y (e.g., through 
bootstrapping or subsampling) . For a random resample Y' from R{Y), we further define 

= E (i{(i, j) G s\Y')}) =e{e (/{(z, j) g s\y')} I y)) . 

In many cases (see Part C in the Supplementary Material), Pi/s and pij^s are close. Then pi/s can 
be estimated by the selection frequencies based on networks built on a large number (say B) of 
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6 S. LI ET AL. 

resamples: 

(2.3) X^j ^ X,j {A{X); Y\...,Y^e R{Y)) = ;^ E I { (^i) e S\y')} . 

k=l 

A general form of the aggregation-based procedures which select edges of large selection frequen- 
cies is 

S^ = {{i,j):X>^j>c}, force (0,1]. 

is reasonable as long as most true edges have selection frequencies greater than or equal to c 
and most null edges have selection frequencies less than c. Ideally we want to find a threshold c 
satisfying 

(2.4) n {^.'>c}[n< 




Pi {X^j < c} > ^ 1 as n ^ oo, 



so that the corresponding procedure is consistent, i.e., Pr{S^ = E) ^ 1. In fact, if A{\) is 
selection consistent and pij — pij — )■ 0, then 

(2.5) Pril fl {XA = 1} i n I fl {XA = 0} i I ^ 1 as n ^ oo, 

and thus any c G (0, 1] satisfies (2.4). The advantage of aggregation-based procedures can be seen 
as (2.4) is in general a much weaker condition than (2.5), which makes it possible to construct a 
consistent even when A{\) is not consistent. 

The following is a simulation example where an aggregation-based procedure is reasonable and 
a better choice over the original procedure. Figure 1 shows the distribution of selection frequencies 
based on a simulated data (see Section 3 for more details). Prom Figure 1(b), we can see that most 
null edges (represented by red crosses) have low selection frequencies (< 0.4), while most true edges 
(represented by green triangles) have large selection frequencies (> 0.6). This suggests that with a 
properly chosen c (say, c G [0.4,0.6]), can include a large proportion of true edges but only a 
small number of null edges. In fact, if A is in a reasonable range, choosing c = 0.5 results in a 
which outperforms A{X) in terms of both FDR and power (Figure 2). 
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Empirical Edge Selection Frequency Distribution 
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True and Null Edge Selection Frequency Distributions 
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Fig 1: The distributions of selection frequencies based on a simulated data set. (a) The distribution of selection frequencies 
across all edges, (b) Distributions of selection frequencies across null and true edges, respectively (note that, these are not 
observable in practice). Simulation is based on a powrer-lawr network with p=500, n=200, and the number of true edges is 495. 
The space algorithm with A=135 is used as the original non-aggregation procedure A(A). For illustrating the tail behavior of 
these distributions more effectively, we only show them on the selection frequency range [0.06,1], as there are many edges with 
selection frequency less than 0.06. 



ROC 




0.05 0.10 0.15 0.20 

FDR 



Fig 2: Receiver operating characteristic (ROC) plot for the aggregation-based procedure with c = 0.5 and the original 
procedure A{\) on a range of A S (96, 160) with the rest of settings the same as in Figure 1. 



2.3. Selection Frequency Modeling. 

Now we introduce a mixture model, similar in spirit to Efron (2004a), for estimating the FDR of 
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an aggregation-based procedure S^. We can then use this estimate to choose the optimal c and A by 
controUing FDR while maximizing power. Assume that the selection frequencies {X^, G ^2}, 
generated from D rcsamplcs, fall into two categories, "true" or ''null'", depending on whether 
is a true edge or a null edge. Let vr be the proportion of the true edges. We also assume that X^j 
has density fi{x) or /q (a?) if it belongs to the "true" or the "null" categories, respectively. Note 
that both and /g depend on the sample size n but such dependence is not explicitly expressed 
in order to make the notation simple. Then, we have a mixture density for X-j-. 

(2.6) f\x) = (1 - 7r)/o^(x) + TTf^ix), X e {0, 1/B, 2/B, . . . , 1}. 

Based on this mixture model, the (positive) FDR (Storey 2003) of the aggregation-based procedure 

(2.7) FDRiS^) = Pr ((z,j) G E^\i^,J) G S^) = /f • 

Given an estimate FDR{S^) of (2.7), the number of true edges in can be estimated by 

(2.8) Ne{S^) = \S^\ (l - FDR{S^)) . 

This allows us to compare the power of across various choices of c and A, as the total number of 
true edges is constant. Suppose the targeted FDR level is a, we first seek for the optimal threshold 

for each A G A: 

(2.9) c*(A) = min{c : FDRiS^) < a} 
and then we find the optimal regularization parameter 

(2.10) A* = 8.vgm8^^^^NE{S^*^^)) 

-A* 

'c*(A*) 



such that the corresponding procedure 5'i*/^*^ achieves the largest power among all competitors 



with FDR not exceeding a. 

The key of the above procedure for determining optimal c and A is to have good estimates of 
FDR, which in turn requires good estimates of the mixture density and its null-edge contribution 
(1 — 7r)/o . A natural estimator of is simply the empirical selection frequencies, i.e., 
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where Nq = p{p — l)/2 is the total number of candidate edges and = ■ X^j = k/B}\ is 

the number of edges with selection frequencies equal to k/B. 

Before describing an approach to estimating it and /g , we note two observations from Figure 
1(b). First, the contribution from the true edges to the mixture density is small in the range 
where the selection frequencies are small. Second, the empirical distribution of /q is monotonically 
decreasing. These can be formally summarized as the following condition. 

Proper Condition: There exist Vi and V2, < Vi < V2 < 1, such that as n ^ 00, 
CI: f^^O on iVi,V2] ; 

C2: /o is monotonically decreasing on (Fi, 1]. 

This proper condition is satisfied by a class of procedures as described in the lemma below (the 
proof is provided in the Appendix). 

Lemma 1. A selection procedure satisfies the proper condition if, as the sample size increases, 
Pij tends to one uniformly for all true edges and has a limit superior strictly less than one for all 
null edges. 

It is easy to verify that all consistent procedures applying to subsampling resamples satisfy 
the condition in Lemma L Another example is procedures that use randomized lasso penalties 
(Menshausen and Biihlmann, 2010). See Section 2.5 for more details. 

The proper condition motivates us to estimate tt and /g by fitting a parametric model ge for f^ 
in the region (Vi, V2] and then extrapolating the fit to tlie region (V2, 1]. This is because if CI is 
satisfied then (1 — 7r)/o can be well approximated based on the empirical mixture density from the 
region (Vi, V2]. If C2 is also satisfied, the extrapolation of go will also be a good approximation to 
(1 — 7r)/Q on (V2, 1] for a reasonably chosen family of g^. 

For the choice of the parametric family, we propose the following. Given pij , it is natural to model 
the selection frequency to follow a (rescaled) binomial distribution, denoted by bi{-\pij), due to the 
independent and identical nature of resampling. Moreover, we use a powered beta distribution (i.e., 
the distribution of Q'"' where Q ~ beta(a, 6), a,b,r > 0) as the prior for Pij's, denoted by 62 (-1^) 
with 6 = (a, b, r). This is motivated by the fact that the beta family is a commonly used conjugate 
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prior for the binomial family and the additional power parameter 7 simply provides more flexibility 
in fitting. Thus, the distribution of selection frequencies of null edges is modeled as 



he{x)= f' h{x\T)b2{T\9)dT, 

Jo 



and then the null-edge contribution (1— 7r)/Q can be estimated by fitting hg to the empirical mixture 
density in the fitting range {Vi,V2], which, in practice, is determined based on the shape of 
(details are given in Section 2.4). Specifically, we estimate tt and /q by tt and h^, where 

(2.11) (tt, d) = argmin,,,L (^f\-), (1 - 7r)/i,(-)) , 

and L{f,g) = —J2xe{Vi V2]lf(^) di^)] which amounts to minimizing the Kullback-Leibler dis- 
tance. 

2.4. Proper Regularization Range. 

Following what we propose in Section 2.3, we can reasonably evaluate the aggregation-based pro- 
cedure for different choices of (A, c) with regard to model-selection-based criteria: the FDR and 
the number of selected true edges. We may only consider A's that correspond to "U-shaped" em- 
pirical distributions of selection frequencies, i.e., /•^'s that decrease in the small-selection-frequency 
range and then increase in the large-selection-frequency range. The decreasing trend is needed for 
the proper condition to hold while the increasing trend helps to control the FDR, since an with 
FDR< a implies, by (2.7), that 

(2.12) Yl /'(^) ^ ^^"^^^-^'-^"^''^ 

x>c 

Moreover, the increasing trend also helps to obtain decent power since it guarantees a substantial 
size of S^. Based on our experience, the optimal choice of A based on (2.9) and (2.10) indeed always 
corresponds to a "U-shaped" empirical selection frequency distribution. 

Thus, we propose the following simple procedure for identifying "U-shaped" /^'s, which deter- 
mines the proper regularization range in practice. An illustration for this procedure is given in 
Figure 3. 

1. Fit the empirical density by a smooth curve (we use the R-function smooth.splinej. 
Then obtain the empirical valley point V2 = argmin^f^{x). Check if V2 < 0.8. 
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2. Find vi = argmax^^(^Q ,^^^f^{x), i.e., the location of which corresponds to the peak before 
V2. Check if f^ is roughly monotonically decreasing on (^1,^2]. 

3. Let /i = {l + V2)/2 be the middle point of [v2, !]• For f'^, calculate its cumulative distributions 
on [v2tIj\ o.nd [fi, 1], denoted by hi = J2x=v2 f^(^) ^""^ ^2 — J2x=ii f^i^)' respectively. Check 
ifh\ < h^. 

4. The mixture density is recognized as "U-shaped" if it passes all three checks above and the 
corresponding A is considered in the proper range. The interval (^1,^2] is then used as the 
fitting range {Vi,V2] for parameters estimation in (2.11). 

Remark 1. The first criterion is based on our extensive simulation where we find that a large 
value of V2 often corresponds to a too-small X, which yields too many null edges with high selection 
frequencies making (2.12) difficult to hold for reasonably small FDR level a's. The third criterion is 
used to verify the increasing trend of f^ at the tail. If f^is not recognized as "U-shaped" for a large 
range of A 's, we would consider the data as lack of signals where a powerful is not attainable. 
One example is the empty network (see Section 3.2 and Figure S-1 in the Supplementary Material). 

U-Shape Identification 

Empirical o ° 
Smoothed 




V, ' ' v, 

0.0 0.2 0.4 0.6 0.8 1.0 

selection frequency 



Fig 3: An illustration for the proposed U-shape identification procedure. The empirical distribution (f^) is the same as the one 
in Figure 1. The smooth curve (/^) is fitted by the R-function smooth. spline with df = 4. Locations of i^i, V2 and ji are found 
following steps 1-3. 

All together Sections 2.2-2.4 provide a procedure for network inference based on directly estimat- 
ing FDR. We name the procedure as BINCO — Bootstrap Inference for Network Construction, 
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as we suggest to use bootstrap resamples. The main steps are summarized below. 



BINCO Procedure 



Step 1 



Generate the selection frequencies for A G A based on a large number of 



Step 2. 



resamples. 

Find a proper range A* of \ by identifying U-shaped f^'s. (Section 2.4) 



Step 3. For A G A*, find the optimal S^^^^,^ where the threshold c* and the 
regularization A* are determined by (2.9) and (2.10) with FDR estimated 
based on (2.1) and (2.11) (Section 2.3). 



2.5. Randomized Lasso. 

For an Li regularized procedure A(A), our proper condition (Section 2.3) is satisfied if ^(A) is 
selection consistent, which, however, often requires strong conditions. For instance, the well-known 
irrepresentable condition under Lasso regression setting (Zhao and Yu, 2006; Zou, 2006; Yuan and 
Lin, 2007; Wainwright 2009) or the so-called neighborhood stability condition under GGM setting 
(Meinshausen and Biihlmann, 2006). Recently, Meinshausen and Biihlmann (2010) proposed the 
Randomized Lasso which is a procedure based on randomly sampled regularization parameters. 
For example, the randomized lasso version of space would be 



where Wjj's are sampled from a probability distribution p{w) supported on (Z, 1] for some I G (0, 1] 
(note that / = 1 corresponds to the ordinary Li penalty). By perturbing the regularization param- 
eters in a proper way, the irrelevant features may be decorrelated from the true features, such that 
with a positive proportion of configurations of the randomly sampled weights, the irrepresentable 
condition is satisfied. As a result, under conditions less stringent than those required for lasso selec- 
tion consistency, the randomized lasso selects all true features with probability 1 and any irrelevant 
feature with a probability strictly less than 1 (Meinshausen and Biihlmann, 2010, Theorem 2). 
Consequently, the proper condition is satisfied. 

If (2.13) is used as the original (non-aggregated) procedure, an additional parameter I, which 
controls the amount of perturbation of the regularization parameter, needs to be chosen. A small 



(2.13) 
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/ guards better against false positives but damages power, while a large / may result in a liberal 
procedure. Here we provide a two-step data-driven procedure for choosing an appropriate I for 
BINCO. We first fix / = 1, i.e. using the ordinary Li penalty, and find a proper range A* for A 
which correspond to the "U-shaped" empirical mixtures. Then for each A G A*, we consider a set 
of pairs A2 = {(Aj, k), i = 1, . . . , m} such that ^p{w)dw = A, i.e., keeping the average amount 
of regularization unchanged. For example, in the simulation study, we use li = i/10, i = 1, ... ,9. 
We then pick the pair (A*,/*) S A2 such that /* is the smallest among those Vs which correspond 
to U-shaped empirical mixture distributions. Our simulation shows that such a choice of (A*,/*) 
ensures good power for BINCO while controlling FDR in a slightly conservative fashion. 

3. Simulation. 

In this section, we first compare BINCO and stability selection (Meinshausen and Biihlmann, 
2010). We then investigate the performance of BINCO with respect to various factors including the 
network structure, dimensionality, signal strength and sample size. 

We use space (Peng, et al. 2009) coupled with randomized lasso (2.13) as the original non- 
aggregate procedure with the random weights Wij^s generated such that their reciprocals follow 
the uniform distribution C/[l,l//] for / G (0,1]. For method comparison, the selection frequencies 
are obtained from B = 100 subsampling samples of size [n/2]^. For the investigation of BINCO's 
performance, the selection frequencies are generated from B = 100 bootstrap resamples as we 
observe that the performance of BINCO is slightly better under bootstrap resampling than sub- 
sampling. In addition, since for simulations we know the truth, i.e, whether an edge is true or null, 
we introduce the ideal power which is the best power that can be achieved for the selection rule 
= {{i,j) '■ Xf- > c} across different choices of (A, c) as far as the true FDR is controlled at a. 
Based on this ideal power, we can evaluate the method efficiency under different settings. For each 
simulation setting, results reported are based on 20 independent simulation runs. 

3.1. Comparison between BINCO and Stability Selection. 

Stability selection procedure selects 5'^afe/e(*) = • ^^^x&A{X^j) > t}, a set of edges with 

the maximum selection frequency over a pre-specified regularization set A exceeding a threshold t. 
Assuming an exchangeability condition regarding to the irrelevant variables (here the null edges), 

'^Suggested by Meinshausen and Biihlmann (2010) to use for stability selection. 
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Menshausen and Biihlmann (2010, Theorem 1) derived an upper bound for the expected number 
of falsely selected variables for each choice of t > 0.5. Specifically, under suitable conditions, the 

expected number of null edges selected by the set 5'^ab;e(^)' denoted by E{V), satisfies 

(3.1) E(V) < , 

where Nci = p{p — l)/2 is the total number of candidate edges and q\ is the expected number of 
edges selected under at least one A G A. In practice, qa can be estimated by ^ J2f=i I UagA S^{Y^)\. 
Dividing both sides of (3.1) by |'S'^a6/e(*)l' ^® obtain 

(3.2) r^ffL < 



\Si^M - {2t - l)Nn ■ \SUM' 
where the left hand side can be interpreted as an approximation of FDR for the set S^^^^i^ (t) . Then 

for a pre-specified FDR level a, we choose the smallest threshold t such that the upper bound on 

the right hand of (3.2) is no larger than a. 

For data generation, we first consider a power-law network with p = 500 nodes whose degree 
(i.e., the number of connected edges for each node) distribution follows P{k) ~ k~'^. The scaling 
exponent 7 is set to be 2.3, which is consistent with the findings in the literature for biological 
networks (Newman, 2003). There arc in total 495 true edges in this network and its topology is 
illustrated by Figure 5(a). Based on this network, we consider two settings with the same sample size 
n = 200 but different signal strengths. The means and standard deviations of the absolute values 
of non-zero partial correlations for the strong and weak signal settings are (/ii,cri) = (0.34,0.13), 
and (//2jO"2) = (0.25,0.09), respectively. 

We compare the performance of BINCO and stability selection at a targeted FDR level of 0.05. 
For BINCO, we consider Aq = {40, 50, ... , 100} as the initial range for A and then obtain the optimal 
final selection following the steps at the end of Section 2.4. For stability selection, since no specific 
guidance was provided for choosing A and I (the randomized lasso regularization perturbation 
parameter), we consider three different values for I € {0.5,0.8,1} and a collection of intervals 

~ (''^min' ''^max) with A^-^^^^ varying from 40 to 100 and Amax = 100 due to the fact that for 
'^min upper bound in (3.2) can not be controlled at 0.05 for any choice of the threshold 

and the performance of stability selection is largely invariant for Amax- 

When the signals are strong, BINCO gives conservative FDR (0.026) but still good power (0.801) 
(Figure 4(a) and 4(c)). The performance of stability selection varies for different choices of A^j^^^ 
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and I. The FDRs are not controlled for some Aj^^^'s when I = 0.8 and for all Aj^jj^'s when I = 1. For 
other cases, the FDR control is very conservative such that the corresponding power is consistently 
lower than BINCO. When the signals arc weak, stability selection is much more conservative than 
BINCO and results in much lower power (Figure 4(b) and 4(d)). In Table 1, we report the ideal 
power, the power for BINCO and the best power for stability selection (among different choices of 
-^min) under 1=0.5, 0.8 and 1. We also provide the power efficiency, i.e., the ratio of the method 
power over the ideal power, for both methods. It can be seen that the power of BINCO is close to 
the ideal power for both signal strength levels while stability selection is too conservative when the 
signal strength is weak. (For more detailed results, see Supplementary Material, Part Al.) 

Table 1 

Power comparison between BINCO and stability selection under strong and weak signals. 







Ideal^ 


BINCO 


Stability Selection 






1 = 1 Z = 0.8 


/ = 0.5 


Strong Signal 


Power 


0.853 


0.801 


0.818^ 0.785^ 


0.706 


MPE^ 


1 


0.939 


0.959^ 0.920^ 


0.828 


Weak Signal 


Power 


0.616 


0.569 


0.434 0.407 


0.170 


MPE^ 


1 


0.924 


0.705 0.661 


0.276 



^ "Ideal" refers to the ideal power that can be achieved when the true distri- 



bution of null edges is known. 
^ Method Power Efficiency (MPE) = method power / ideal power. 
3 FDR control failed. 
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= 0.05 \ 








\ stab. Sel.. 1 = 0.8 


BINCO FDR = 


0.026 
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Controlled FDR = 0.05 



BINCO FDR = 0.054 




Stab. Sel.. I = 1 
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(a) 



(b) 



Power. Strong Signal 



Power. Weak Signal 



BINCO Power = 0.801 




: - BINCO Power = 0.569 



o 
a. 



stab. Sel.. I = 1 




40 50 60 70 80 90 100 



Stab. Sel.. I = 0.5 



50 55 60 65 70 75 80 



(c) 



(d) 



Fig 4: The FDR (top panels) and power (bottom panels) for BINCO and stability selection (Stab. Sel.). (a) and (c) are for the 
strong signal setting; (b) and (d) are for the weak signal setting. 



3.2. Further Investigation on BINCO. 

Now we investigate the effects of the network structure, dimensionahty, signal strength and 
sample size on the performance of BINCO. 
Network Structure 

We consider four different network topologies: empty network, power-law network, empir- 
ical network and hub network. In each network, there are five disconnected components with 
100 nodes each. Below is a brief description of the network topologies. 



1. Empty network: there is no edge connecting any pair of nodes. 
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2. Power-law network: the degree follows a power-law distribution with parameter 7 = 2.3 
as described in Section 3.1 (Figure 5(a)). 

3. Empirical network: the topology is simulated according to an empirical degree distribution 
of one genetic regulatory network (Schadt et al. 2005) (Figure 5(b)). 

4. Hub network: three nodes per component have a large number of connecting edges (> 15) 
and all other nodes have a small number of connecting edges (< 5) (Figure 5(c)). 

We set the sample size n = 200. The signal strength for all networks except for the empty network 
is fixed at the strong level as in Section 3.1. 

For the empty network, the empirical mixture distributions of selection frequencies are mono- 
tonically decreasing on a wide range of A (Figure S-1) which are not recognized by BINCO as 
"U-shaped". Thus, we reach the right conclusion that there is no signal in this case. In contrast, 
data sets from the other three networks produce the desired "U-shaped" mixture distributions for 
some A (Figure S-2). 

We compare BINCO results across networks 2-4 with FDR controlled at level a = 0.05 and 0.1. 
BINCO gives slightly conservative control on FDR and achieves reasonable power for all three net- 
works (Figure 6). The comparison to the ideal power shows that the network topologies investigated 
here have only a small effect on BINCO's efficiency (Table 2). 




(a) (b) (c) 

Fig 5: Different network topologies: (a) Power-law Network, number of true edges = 495; (b) Empirical Network, number of 
true edges=633; (c) Hub Network, number of true edges=587. All three networks have p = 500 nodes. 
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^ level = 0.06 
M level = 0.10 



IM level = 0.05 
^ level =0.10 



Power-law Empirical 



(a) 



Power-law Empirical 



(b) 



Fig 6: True FDR (a) and Power (b) for BINCO at targeted FDR level 0.05 (red) and 0.1 (green) for the Power-law, Empirical, 
and Hub networks, respectively. Heights of the bars represent the means over 20 simulated data sets with whiskers indicating 
one standard deviations. 



Table 2 

Comparison of BINCO power and ideal power under different networks. 





Controlled FDR = 


0.05 


Controlled FDR = 


0.10 


Topology 


Power-law Empirical 


Hub 


Power-law Empirical 


Hub 


BINCO Power 
Ideal Power 
MPE^ 


0.810 0.523 
0.856 0.595 
0.946 0.879 


0.644 
0.736 
0.875 


0.845 0.565 
0.881 0.631 
0.959 0.895 


0.692 
0.776 
0.892 



Method Power Efficiency (MPE) = method power / ideal power. 



Dimensionality 

We investigate the impact of dimensionality on the performance of BINCO. We consider the 
power-law network and let the number of nodes p vary from 500, 800 to 1000. To keep the complexity 
of each component the same across different choices of p, we set the component size constant, 100, 
and the number of components C = p/WO. Sample size n = 200 is used for all three cases. Again, 
the signal strength is fixed at the strong level as in Section 3.1. 

For all three choices of p, BINCO performs similarly (Figure 7), with slightly conservative FDR 
and power around 0.8. The dimensionality does not show a significant impact on BINCO. This is 
probably due to each component is fixed on complexity. BINCO's result is also largely invariant 
when we compare networks of different number of components with p fixed (such that component 
size varies, see the Supplementary Material, Part A3). 
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level = 0.06 
level = 0.10 



13 level = 0.05 
M level = 0.10 



P=800 

Dimension 



P=800 

Dimension 



(a) 



(b) 



Fig 7: True FDR (a) and Power (b) for BINCO at targeted FDR level 0.05 (red) and 0.1 (green) when p = 500, 800, and 
1000, respectively. Heights of the bar represent the means over 20 simulated data sets with whiskers indicating one standard 
deviations. 



Signal Strength 

We consider three levels of signal strength, namely, strong, weak and very weak. The corre- 
sponding means and standard deviations of the absolute values of non-zero partial correlations are 
(^i,cji) = (0.34,0.13), (/i2,o-2) = (0.25,0.09) and (^3,cr3) = (0.21,0.07), respectively. The network 
is the power-law network with p = 500 and sample size is n = 200 for all settings. 

BINCO gives good control on FDR, however the power decreases from 0.8 to 0.3 as the signal 
weakens (Figure 8). Comparing the power of BINCO with the ideal power (Table 3), we see that 
BINCO remains efficient and the loss in power is largely due to the reduction of signal strength. 



Table 3 

Power Comparison of BINCO power and ideal power when the signal strength is 
strong, weak and very weak. 





Controlled FDR = 0.05 


Controlled FDR = 0.10 


Signal Strength 


Strong 


Weak Very Weak 


Strong Weak Very Weak 


BINCO Power 
Ideal Power 
MPE^ 


0.810 
0.856 
0.946 


0.579 0.252 
0.615 0.279 
0.941 0.903 


0.845 0.617 0.310 
0.881 0.651 0.345 
0.959 0.948 0.899 



Method Power Efficiency (MPE) = method power / ideal power. 
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Fig 8: True FDR (a) and Power (b) for BINCO at targeted FDR level 0.05 (red) and 0.1 (green) when the signal strength is 
strong, weak and very weak, respectively. Heights of the bars represent the means over 20 simulated data sets with whiskers 
indicating one standard deviations. 



13 level = 0.05 
El Iev6l = 0.10 



13 level = 0.05 
Ellevel=0.10 



500 

Sample Size (n) 



500 

Sample Size (n) 



(b) 



Fig 9: True FDR (a) and Power (b) for BINCO at targeted FDR level 0.05 (red) and 0.1 (green) when the sample size n = 200, 
500 and 1000, respectively. Heights of the bars represent the means over 20 simulated data sets with whiskers indicating one 
standard deviations. 



Sample Size 

Now we consider the impact of sample size n by varying it from 200, 500 to 1000, while keeping 
the signal strength at the "very weak" level as in the previous simulation. The network structure 
is again the power-law network with p = 500. 

With an increased sample size, the power of BINCO is significantly improved from 0.3 to nearly 
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0.9 (Figure 9(b)) while the FDRs are well controlled (Figure 9(a)). 

In summary, BINCO has good control for FDR under a wide range of scenarios. Its performance 
is shown to be robust for networks with different topologies and dimensionalities, and its efficiency 
is not influenced much even when the signal strength is weak. As the sample size increases, the 
power of BINCO is improved significantly. More detailed results can be found in the Supplementary 
Material, Part A2. 

4. A Real Application. 

We apply the BINCO method to a microarray expression data set of breast cancer (BC) (Loi, et 
al, 2007) to build a gene expression network related to the disease. The data (http://www.ncbi. 
nlm.nih.gov/geo/query/acc. cgi?acc=GSE6532) contains measurements of expression levels of 
44928 probes in tumor tissue samples from 414 BC patients based on the Affymetrix Human 
Genome U133A, U133B and U133 plus 2.0 Microarray platforms. 

We preprocess the data as follows. First, a global normalization is applied by centering the me- 
dian of each array to zero and scaling the Median Absolute Deviation (MAD) to one. Probes with 
standard deviation (SD) greater than the 25%-trimmed mean of all SDs are selected. We further 
focus on a subset of 1257 probes for genes from cell cycle and DNA-repair related pathways (http : 
//peiwang . f here . org/ internal/papers/DNArepair_CellCircle_r elated . csv/ view) , as these 
pathways have been shown to play significant roles in breast cancer tumor initiation and develop- 
ment. Clinical information including age, tumor size, ER-status (positive or negative) and treatment 
status (tamoxifen treated or not) is incorporated in the analysis as "fake genes" since we are also 
interested in investigating whether/how gene expressions are associated with these outcomes. Fi- 
nally we standardize each expression level to have mean zero and SD one. The resulting dataset 
for resampling has p = 1261 genes/probes (including four clinical variables) and n = 414 tumor 
samples. 

We generate selection frequencies by applying the space algorithm with randomized lasso regu- 
larization to B = 100 bootstrap resamples. The initial range of the tuning parameter A is set to 
be A = (100, 120, . . . , 580). We then apply the BINCO procedure and find that the optimal values 
for the regularization parameters are A = 340 and / = 0.9. The empirical distribution of selection 
frequencies of all edges and the null density estimation are given in Figure 10. When the estimated 
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FDR is controlled at 0.05, 0.1 and 0.2, BINCO identifies 125, 222 and 338 edges, respectively. The 
estimated network for FDR=0.2 is shown in Figure 11. In this figure, two components of large con- 
nectivity structure are observed which contain most of the high-degree genes, i.e., genes connected 
by a large number of high-selection-frequency edges. Note that for the four clinical variables, the 
only high-selection-frequency edge is the one between age and ER-status (with selection frequency 
= 0.96). All edges between one clinical variables and genes/probes are insignificant (with selection 
frequencies less than 0.12). 

Networks built on perturbed data sets can also be used to detect hub genes (i.e., highly connected 
genes) which are often of great interest due to the central roles these genes may play in genetic reg- 
ulatory networks. The idea is to look for genes which show consistent high connection in estimated 
networks across perturbed data sets. Here, we propose to detect hub genes by the ranks of their 
degrees based on the estimated networks using A = 340 and I = 0.9. The ten genes with the largest 
mean as well as the smallest standard deviation of degree rank across 100 bootstrap resamples 
(Figure 12) are MBD4, TARDBP, DDB2, MAP3K4, 0RC3L, CDKNIB, REL, ATR, LGMN and 
CDKN3. Eight out of these ten genes have been reported relevant to BC, while the role of the other 
two (TARDBP, 0RC3L) in BC is not clear at present and might lead to further investigation. Note 
that all of these ten genes also appear to be hubs in network estimated by BINCO at FDR of 0.2 
(Figure 11, blue-symbol-labeled). More details for these hub genes are given in the Supplementary 
Material, Part B. 
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Fig 10: The empirical selection frequency distribution of all edges (dots) and the estimated selection frequency distribution of 
null edges (solid line). The three vertical lines are drawn at the cut-offs CI = 0.98, C2 = 0.93 and C3 = 0.85 for FDR at 0.05, 
0.1 and 0.2, respectively. 
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Fig 11: Inferred networks at FDR=0.2 from the breast cancer expression data. A total of 338 edges are identified. Genes with 
degree> 3 are labeled by their symbols; genes with degree> 4 are indicated by red nodes. In addition, the top ten genes with 
consistently high connection in estimated networks across perturbed data sets are labeled in blue symbols. 
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Fig 12: A scattcr-plot of the standard deviation (SD) of degree ranks (small to large) versus the mean of degree ranks (large to 
small) of each gene/probe. The solid circles are the top 10 genes. 



5. Summary. In this paper, we propose the BINCO method to conduct inference for high- 
dimensional Gaussian graphical models. BINCO employs model aggregation strategies and selects 
edges by directly estimating the FDR. This is achieved by modeling the selection frequency dis- 
tribution using a mixture model. A flexible parametric distribution is used to model the density 
corresponding to the null edges. By doing this, BINCO is able to provide a good control on the 
FDR. Moreover, BINCO identifies the U-shape characteristic of empirical selection frequency dis- 
tribution as a criterion for proper amount of regularization. Extensive simulation results show that 
BINCO perform well under a wide range of scenarios, indicating that it can be used as a practical 
tool for network inference. Although we focus on the COM construction problem in this paper, 
BINCO is applicable to a wide range of problems where model selection is needed as it provides a 
general approach modeling the selection frequencies. 



APPENDIX A: PROOF OF LEMMA 1 

Proof. Suppose as the sample size n increases, an edge selection procedure A{X) gives selection 
probabilities {p\^^ } (with respect to resample space) which uniformly satisfy 



(A.1) 



(n) 



1, if{i,j)eE 
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and 

(A.2) limsuppj"^ < M < 1, if e E". 

Suppose B is large such that ^^^M < 1. Let X be a random sample from the set of selection 
frequencies {X^^} generated by applying ^(A) on B resamples, i.e., Pr{X = X^^) = 1/Nq, G Jl. 
Also suppose X has density f^^ if X = Xf-. Then the mixture model (2.6) becomes 

(A.3) Ax) = (l-7r)/o^(x) + 7rA^(x)= + E 

with (1 - 7r)/o^(x) = E(,j)e£;c ^/^(x) and 7rA^(a;) = 

Because of the i.i.d. nature of resamples given the data, /j^- is a binomial density with jo-J"* as 
the probability of success, i.e., f^j{x) = {^){Pij^)^{^ - Pij^)^~^ for x = k/B, k = 0,1, . . . , B. This 
bmomial density is monotone decreasing for x greater than its mode m., = ^^^^ or ^E^Dftl. 
By (A.2), given Vi = < 1 and e > such that Vi + e < 1, 3N such that for all ra > 

max(jj)g£;c()Uij) < Vi + e and hence for any null edge G E'^, flj{x) is monotone decreasing 
on [Vi + e, 1], which implies C2 since f^{x) = (j-^^ E(tj)e£;<= /^C^)- Also, (A.l) implies, for 
G E, fij{x) —7- uniformly for x < 1, which implies CI for any V2 < 1. Taking V2 such that 
Vi < V2 < 1 satisfies the proper condition and completes the proof. □ 
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Part A: Additional Simulation Result 

Al: Method Comparison 

Here are more detailed result for the comparison between BINCO and stability selection. The 

simulation setting is the same as that in Section 3.1. Tables S-1 to S-6 are the results for stability 

selection and Table S-7 is the result for BINCO. 

Table S-1 

Power and FDR of Stability Selection, I = 1, Xmax = 100, Strong Signal. 





Controlled FDR = 


0.05 


Controlled FDR = 


0.10 




FDR 


Power 


FDR 


Power 




Mean 


SD 


Mean 


SD 


Mean 


SD 


Mean 


SD 


< 40 




no solution 






no solution 




50 


0.061 ^ 


0.008 


0.818 


0.013 


0.077 


0.009 


0.836 


0.012 


60 


0.060 ^ 


0.007 


0.783 


0.011 


0.063 


0.008 


0.790 


0.011 


70 


0.054 1 


0.007 


0.725 


0.010 


0.055 


0.007 


0.729 


0.012 


80 


0.050 


0.007 


0.663 


0.010 


0.051 


0.006 


0.666 


0.011 


90 


0.050 


0.006 


0.567 


0.014 


0.050 


0.006 


0.572 


0.014 


100 


0.062 1 


0.008 


0.418 


0.006 


0.061 


0.007 


0.427 


0.016 



FDR control failed. 

Table S-2 

Power and FDR of Stability Selection, I = 0.8, Xmax = 100, Strong Signal. 





Controlled FDR = 


0.05 


Controlled FDR = 


0.10 




FDR 


Power 


FDR 


Power 


^min 


Mean 


SD 


Mean 


SD 


Mean 


SD 


Mean 


SD 


40 




no solution 




0.099 


0.012 


0.823 


0.011 


50 


0.077 ^ 


0.009 


0.785 


0.011 


0.091 


0.012 


0.797 


0.011 


60 


0.056 ^ 


0.009 


0.734 


0.011 


0.059 


0.009 


0.739 


0.012 


70 


0.033 


0.008 


0.668 


0.010 


0.036 


0.009 


0.675 


0.009 


80 


0.017 


0.006 


0.568 


0.012 


0.018 


0.007 


0.574 


0.013 


90 


0.010 


0.007 


0.400 


0.015 


0.010 


0.008 


0.408 


0.015 


100 


0.005 


0.007 


0.207 


0.001 


0.007 


0.008 


0.214 


0.010 



FDR control failed. 



Table S-3 

Power and FDR of Stability Selection, I = 0.5, Xmax = 100, Strong Signal. 



40 
50 
60 
70 

> 80 



Controlled FDR = 0.05 



FDR 

Mean 

0.007 
0.007 
0.001 




Power 



SD 


Mean 


SD 


0.005 


0.706 


0.011 


0.004 


0.662 


0.009 


0.002 


0.526 


0.017 





0.271 


0.012 



Controlled FDR = 0.10 



FDR 

Mean 

0.021 
0.010 
0.002 


Omitted for too small power. 



Power 



SD 


Mean 


SD 


0.006 


0.747 


0.006 


0.005 


0.676 


0.005 


0.003 


0.540 


0.003 





0.287 


0.013 
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Table S-4 

Power and FDR of Stability Selection, 1 = 1, Xmax = 100, Weak Signal. 





Controlled FDR = 


0.05 


Controlled FDR = 0.10 




FDR 


Power 


FDR 


Power 


^min 


Mean 


SD 


Mean 


SD 


Mean 


SD 


Mean 


SD 


< 40 




no solution 






no solution 




50 




no solution 




0.006 


0.006 


0.490 


0.020 


60 


0.003 


0.004 


0.434 


0.017 


0.004 


0.004 


0.458 


0.013 


70 





0.002 


0.288 


0.017 


< 0.001 


0.001 


0.298 


0.018 


80 








0.126 


0.015 








0.131 


0.014 


> 90 






Omitted for t 


oo small 


power. 







Table S-5 

Power and FDR of Stability Selection, I = 0.8, Xmax ■ 



100, Weak Signal. 



Xr 



Controlled FDR = 0.05 

FDR Power 
Mean SD Mean SD 



Controlled FDR = 0.10 

FDR Power 
Mean SD Mean SD 



40 

50 0.002 

60 0.001 
70 

> 80 



no solution 
0.003 0.407 
0.003 0.328 
0.143 



0.026 
0.016 
0.016 



0.006 
0.001 




no solution 
0.006 0.485 
0.003 0.342 
0.149 



0.016 
0.014 
0.015 



Omitted for too small power. 



Table S-6 

Power and FDR of Stability Selection, I = 0.5, Xmax = 100, Weak 

Signal. 



^min 


Controlled FDR = 0.05 
FDR Power 
Mean SD Mean SD 


Controlled FDR = 0.10 
FDR Power 
Mean SD Mean SD 


40 

50 
60 

> 70 


no solution 
no solution 
0.006 0.012 

Omitted for t 


no solution 
0.001 0.002 0.170 0.020 
0.025 0.011 

oo small power. 



Table S-7 
Power and FDR of BINCO 





Controlled FDR = 0.05 


Controlled FDR = 0.10 


Signal 


FDR 


Power 


FDR 


Power 


Strength 


Mean SD 


Mean SD 


Mean SD 


Mean SD 


Strong 


0.026 0.016 


0.801 0.023 


0.056 0.025 


0.835 0.016 


Weak 


0.034 0.011 


0.569 0.032 


0.059 0.017 


0.610 0.028 



A2: Further Investigation of BINCO 's Performance 
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0.0 0.4 0.8 
selection frequency 



0.0 0.4 0.8 
selection frequency 




0.0 0.4 0.8 
selection frequency 



0.0 0.4 0.8 
selection frequency 



Fig S-1: Diagnostic on the empirical distribution of selection frequencies from empty network: no 
observed for A in a wide range. 



'U-shape" characteristic is 




0.0 0.4 0.8 
selection frequency 



0.0 0.4 0.8 
selection frequency 



0.0 0.4 0.8 
selection frequency 



0.0 0.4 0.8 
selection frequency 



Fig S-2: Diagnostic on the empirical distribution of selection frequencies from power-law network; "U-sliape" characteristic 
is observed for A in a wide range. Note the "U-shapc" characteristic is also observed for the empirical selection frequency 
distributions from empirical and hub networks. Those diagnostic plots are very similar to this one and hence omitted. 
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Table S-8 

Investigation of BINCO on Networks of Different Topologies 





Controlled FDR = 


0.05 


Controlled FDR = 0.10 


Network 


FDR 


Power 


FDR 


Power 


Topology 


Mean 


SD 


Mean 


SD 


Mean 


SD 


Mean SD 


Power-law 


0.046 


0.009 


0.810 


0.013 


0.096 


0.013 


0.845 0.013 


Empirical 


0.032 


0.019 


0.523 


0.040 


0.068 


0.034 


0.565 0.040 


Hub 


0.023 


0.009 


0.644 


0.021 


0.052 


0.012 


0.692 0.017 



Table S-9 

Investigation of BINCO on Power-law Networks of Different Dimensionality 





Controlled FDR = 


0.05 


Controlled FDR = 0.10 


Dimension 


FDR 


Power 


FDR 


Power 


(P) 


Mean 


SD 


Mean 


SD 


Mean 


SD 


Mean SD 


500 


0.046 


0.009 


0.810 


0.013 


0.096 


0.013 


0.845 0.013 


800 


0.030 


0.007 


0.769 


0.010 


0.083 


0.010 


0.811 0.012 


1000 


0.043 


0.007 


0.784 


0.008 


0.096 


0.011 


0.821 0.007 



Table S-10 

Investigation of BINCO on Power-law Networks of Different Signal Strength 





Controlled FDR = 


0.05 


Controlled FDR = 0.10 


Signal 


FDR 


Power 


FDR 


Power 


Strength 


Mean 


SD 


Mean 


SD 


Mean 


SD 


Mean SD 


Strong 


0.046 


0.009 


0.810 


0.013 


0.096 


0.013 


0.845 0.013 


Weak 


0.032 


0.010 


0.579 


0.024 


0.063 


0.014 


0.617 0.018 


Very Weak 


0.035 


0.026 


0.252 


0.040 


0.065 


0.037 


0.310 0.039 



Table S-11 

Investigation of BINCO on Power-law Networks of Different Sample Size 
with Very Weak Signal Strength 





Controlled FDR = 


0.05 


Controlled FDR = 0.10 


Sample 


FDR 


Power 


FDR 


Power 


Size 


Mean 


SD 


Mean 


SD 


Mean 


SD 


Mean SD 


200 


0.035 


0.026 


0.252 


0.040 


0.065 


0.037 


0.310 0.039 


500 


0.024 


0.010 


0.684 


0.012 


0.049 


0.011 


0.714 0.014 


1000 


0.015 


0.010 


0.869 


0.013 


0.090 


0.01,5 


0.891 0.012 



A3: Additional Simulation for BINCO. 



Here we investigate the impact of the number of components in the networks on BINCO 's 
performance. We consider the power-law network with sample size n = 200 and the number of 
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nodes p = 500. Again, the signal strength is fixed at the strong level as in Section 3.1. Networks 
are generated by varying the number of components C = 5, 2 to 1. 

Since components are independent, the model dimensionality is the size of each component which 
is smaller for larger C. Thus, as the number of components decreases, we might expect it is more 
challenging to detect the network due to the increasing dimensionality. However, for networks with 
different number of components, BINCO provides similar (and slightly conservative) control for 
FDR (Figure S-3(a)) and the power is also similar (about 0.8, see Figure S-3(b)). Detailed results 
are summarized in Table S-12. 



level = 0.05 
ESI level = 0.10 



number of clusters 



(a) 



1^ level = 0.05 
Ellevel=aiO 



number of clusters 



(b) 



Fig S-3: True FDR (a) and Power (b) for BINCO at targeted FDR level 0.05 (red) and 0.1 (green) when the number of 
components in the network is five, two and one, respectively. Heights of the bar represent the means over 20 simulated data 
sets with whiskers indicating one standard deviation. 



Table S-12 

Investigation of BINCO on Power-law Networks with Different Number of 

components 





Controlled FDR = 


0.05 


Controlled FDR = 0.10 


Number of 


FDR 


Power 


FDR 


Power 


components 


Mean 


SD 


Mean 


SD 


Mean 


SD 


Mean SD 


5 


0.046 


0.009 


0.810 


0.013 


0.096 


0.013 


0.845 0.013 


2 


0.048 


0.012 


0.783 


0.011 


0.096 


0.016 


0.814 0.011 


I 


0.039 


0.015 


0.804 


0.011 


0.095 


0.020 


0.836 0.013 



imsart-aoas ver. 2009/08/13 file: BINCO_arXiv.tex date: November 23, 2011 



BINGO 



33 



Part B: Details of the Hub Genes Detected by Applying BINCO on Real Data 



Table S-13 

Annotation of Hub Genes and Their Connections to breast cancer BC. 



Rank of 
Degree ^ 


Gene 
Symbol 


Pathway 
belongs to 


Function Summary 


Connection to Breast Cancer 


1 


MBD4 


Brontani-DNA-Methylation- 


encoding motliyl-CpG 


ovcr-cxprcsscd and amplified in hu- 






and-Modification; 


binding domain protein 4. 


man BC (Zhu, et al, 1999) and dif- 






DNA-Binding 




ferentially expressed in mammary 

epithelial cells (Jiang, et al., 2010); 


2 


TARDBP 


DNA-Binding 


encoding TAR DNA binding 
protein. 


needs further investigation. 


3 


DDB2 


Brentani- Repair; 


encoding damage-specific 


highly expressed in the human ER- 






DNA-Damage-Signaling; 
Damagcd-DNA-Binding; 


DNA binding protein 2, 
48kDa. 


positive breast tumor samples and 

plays a significant role as an acti- 






DNA-Binding 




vator of BC cell growth (Kattan, et 
al, 2008). 


4 


MAP3K4 


Brentani-Signaling 


encoding mitogen-activated 
protein kinase kinase kinase 
4. 


plays a role in the signal transduc- 
tion pathways of breast cancer cell 
proliferation, survival, and apopto- 
sis (Bild and Johnson, 1998). 


5 


0RC3L 


Cell-Cycle-KEGG; 

CcU-CycIe; 

Gl-to-S-Cell- 

Cycle-Reactomc; 

DNA- Replication- Reactome; 

HSA04110-Cell-Gycle; 

DNA-Binding 


encoding origin recognition 
complex, subunit 3-like 
(yeast). 


needs further investigation. 


6 


GDKNIB 


Brentani-Gell-Cycle; 


encoding cyclin-dcpcndcnt 


an essential regulators of coll cycle 






Cell-Cycle- Regulator; 


kinase inhibitor IB (p27, 


progression where its genetic vari- 






Cell-Cycle; 


Kipl), controlling the cell 


ants has been verified to be associ- 






Gl-to-S-Cell-Cycle- Reactome; 


cycle progression at Gl. 


ated to BC risk (Ma, et al, 2006 






CcU-Cyclc- Arrest; 
HSA04110-Cell-Cycle 




and Canbay, et al, 2010). 


7 


REL 


Brentani-Signaling 


encoding c-Rel, a 
transcription factor that is 
a member of the Rel/NFKB 
family. 


may be involved early in the 
progression of breast epithelial 
cells towards malignancy (Romieu- 
Mourez, et al, 2001). 


8 


ATR 


DNA-Damage-Signaling; 


encoding atajcia 


there are potential interaction 






Cell-Cycle-Gheckpoint-II; 


telangiectasia and Rad3 


effects of variations in 






HSA04110-Cell-Gycle 


related 


ATM/ATR/BRCA1/BRCA2 genes 
for breast cancer (Wang, et al, 
2010). 


9 


LGMN 


DNA-Damage-Signaling 


encoding legumain, a 
cysteine protease that has a 
strict specificity for 
hydrolysis of asparaginyl 
bonds. 


may play a role in tumor progres- 
sion and is important in prognostic 
for BC (Gawenda, et al, 2007). 


10 


GDKN3 


Cell-Cycle- Arrest 


encoding cyclin-dependent 
kinase inhibitor 3. 


over-expressed in and hence associ- 
ated with breast and prostate mar- 
lignancies (Lee, et al, 1999). 



The rank of the number of connected edges (from the largest to the smallest) for each gene based on the estimated network by BINCO. 
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Part C: Examples When pij And pij Are Close 

Example 1. An original procedure with subsampling. 

Since subsampling Y^'^^^ of size m from a random sample Y(^j^-j is equivalent to directly sampling 

a random sample y(m) of size m, the asymptotic behavior of pij and pij should be the same. In 
particular, if p\j^ has a limit, then converges to the same limit and hence p\j ^ — p\™'^ — > as 
m, n — >■ oo. 

Example 2. Selection consistent Lasso under linear regression settings. 
Consider a linear regression model 

Y = X/3 + e 

where, for sample size n, Y is an n x 1 response, X = {Xi, . . . ,Xp) is the n x p design matrix 
and € is the random error with mean and covariance 1. P is the coefficient vector that needs to 
estimate. 

Denote the covariance matrix o/ X by C = £^(X'X) and write C as 



C 



Cll Ci2 

C21 C22 

where Cn is the covariance matrix of relevant variables in X, C22 is the covariance matrix of 
irrelevant variables in X and C12 = C21 is the matrix of covariance between relevant and irrelevant 
variables in X. Whenp is fixed, the selection consistency of the Lasso procedure is equivalent to the 
irrepresentable condition (Zhao and Yu, 2006) 

(S-1) \Ci2{Cii)-hign{m)\ < 1 - »? 

where /3(1) is the non-zero coefficients for the relevant variables in the linear model, r] is a positive 
constant vector and sign{-) maps positive entry to 1, negative entry to -1 and zero to zero. Denote 
the Lasso estimator of P by P and the one based on bootstrap data by p. Also define the event 

S = {sign{P) = sign{P)} 

and 

S = {sign{P) = sign{P)}. 
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Note that ~ is used to represent the counterpart in the bootstrap sample space to that in the sample 
space. It can he shown that P{S) — >■ 1 implies P{S) — >■ 1, i.e., the Lasso procedure is also consistent 
on bootstrap resample data. Thus, denote the selection probability of the i''^ feature w.r.t. the sam,ple 
space by Pi and that w.r.t. the bootstrap resample space by pi, then pi and pi converge to the same 
limit (1 or 0, depending on whether the i^^ feature is a true or irrelevant one) for all 1 < i < p. 
Details for showing P{S) — >■ 1 are given below. 

We use the notation consistent with Zhao and Yu (2006). First we see that, under the finite- 
moment assumption of X, both the sample covariance and bootstrap resample covariance C"' 
converge to the same limit C (Arenal- Gutierrez, et al. 1996), which means the Proposition 1 in 
Zhao and Yu (2006) can be applied to the bootstrap resample data. Then, 

1 - p{s) <Y^p(\i^\ > v^m - (ici > ^^.) 

where (zf , . . . , zj, Cf, • ■ • , Cplq)' = D'^W'^ with 



= X'e/^n, and b = (b^, . . . , 6"^) = iC\y^signf3{l) . 
Denote the counterpart ofW^ and w.r.t. the sample space by and D^ . Note that — t-^ 
N{0,C) and by Theorem 2.2 of Bickel and Freedman (1981) W'^-W -^d N{0,C). Also note that 
f)n_j^n^Q D"" ^ D where 



D 



By the Slutsky's Theorem, 



(Cii)-i 

C2l(Cii)-l -1 

D^W"" D ■ iV(0, C) 



and 

B^^^ri _ jjny^n ^ f)n^y^rn _ ^rn^ + ^jjn _ D ■ N{0, C), 

which implies that zf — zf and zf have the same limiting distribution. Thus, for A„ such that 
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l+c 



^nl'n — >■ 0, A„/n 2 — >■ oo with < c < 1, 

9 



(S-2) < ^[p(^|4^-^f|>l 



An 

2^' 



6" 



(S-3) 



o(e-" ), 



where (S-3) uses the result from the proof of Theorem 1 in Zhao and Yu (2006) while for (S-2) it 
is because P{Zi + Z2 > t) < P{max{Zi, Z2) + max{Zi, Z2) > t) = P{Zi > t/2 or Z2 > t/2) < 
P{Zi > t/2) + P{Z2 > t/2). Similarly, it can he shown that 

p-q 



Therefore, P{S) 1. 

Analogues to the above, it can be shown that lasso is consistent w.r.t. both the sample and the 
bootstrap resample space under (S-1) and additional regularity conditions when p is allowed to grow 
as n grows. 

Example 3. Consistent space procedure (Peng, et al. 2009) for network construction. 

Similar to the irrepresentable condition for the lasso regression case, the selection consistency 
for the space procedure is implied by a condition imposed on the second derivative of the objective 
loss function which converges to the same limit for both sample data and bootstrap resample data. 
Under such condition and additional regularity conditions, the probability of space procedure being 
inconsistent based on bootstrap resamples can be upper bounded by a small number in a similar way 
as bounding that based on samples, which implies that the space procedure is also consistent w.r.t. 
the bootstrap resmaple space and hence pij — Pij for all {i,j) G ^■ 
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